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STATISTICAL  TOOLS  FOR  DETERMINING  FITNESS  TO  FLY 


AN  OVERVIEW 

Military  personnel  and  other  groups  are  routinely  subject  to  regularly 
scheduled  physical  examinations  or  checkups.  Beyond  providing  the  personal 
benefits  of  continual  health  care,  the  checkups  also  serve  to  identify  high 
risk  cases.  If  the  subject  is  a  flyer  or  is  responsible  for  dangerous  equip¬ 
ment,  a  high  probability  of  an  incapacitating  event  may  require  a  change  of 
assignment  to  a  less  hazardous  one.  The  goal  of  this  project  is  to  use  data 
from  regularly  scheduled  checkups  to  estimate  the  probability  of  an  event  such 
as  a  heart,  attack. 

The  task  considered  here  is  to  construct  a  mathematical  model  to  estimate 
the  probability  of  an  event.  This  model  would  allow  the  examining  physician 
to  summarize  the  subject's  medical  history  in  a  meaningful  way.  A  high  proba¬ 
bility  of  an  event  as  computed  by  the  model  would  be  evidence  of  a  high  risk 
case.  A  number  of  models  have  been  created  for  estimating  the  probability 
distribution  of  time  to  event,  the  so-called  failure  time.  The  most  notice¬ 
able  contribution  in  the  biostati sticel  literature  has  been  Cox's  (3)  propor¬ 
tional  hazard  inode!  for  possibly  censored  data  using  concomitant  information. 
It  has  been  applied  to  cancer  data  with  a  good  deal  of  success,  and  a  number 
of  extensions  of  the  model  have  appeared  in  the  literature;  cf.  Breslow  (2), 
Cox  (3),  Peduzzi  et  al .  (4) ,  Taulbee  (12),  Prentice  and  Kalbfleisch  (9).  How¬ 
ever,  there  are  qualitative  differences  between  the  populations  of  diagnosed 

i 

cancer  patients  and  of  generally  healthy  military  personnel.  In  the  former, 
eventual  failure  occurs  in  a  large  proportion  of  the  cases;  in  the  latter,  the 
event  is  relatively  rare.  Also,  the  chance  of  loss  to  followup  is  fairly 
large  among  the  civilian  population,  while  loss  to  followup  is  less  of  a 
problem  among  military  personnel,  especially  rated  career  officers.  Finally, 
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the  data  needed  to  predict  survival  among  cancer  patients  is  usually  first 
collected  at  the  time  of  initial  diagnosis.  The  data  to  predict  events  among 
healthy  subjects  is  limited  to  that  routinely  gathered  during  the  periodic 
checkups.  These  differences  lead  us  to  propose  a  separate  model  which  we  have 
termed  the  periodic  checkup  predictive  model.  The  mode!  is  a  survival  dis¬ 
tribution  model  similar  both  to  Cox's  proportional  hazard  model  when  there  is 
little  or  no  loss  to  followup,  and  to  logistic  discrimination  (Press  and 
Wilson  (10))  when  the  object  is  to  predict  the  occurrence  or  nonoccurrence  of 
an  event  during  a  fixed  interval. 

The  chief  motivation  behind  the  model  is  to  mimic  the  decision  process  of 
the  examining  physician  at  the  end  of  a  regularly  scheduled  checkup.  The 
physician  must  decide  on  the  strength  of  the  current  examination  findings  plus 
the  subject's  previous  medical  history  whether  there  is  sufficient  risk  of  an 
event  to  require  further  tests.  The  horizon  of  the  event  period  of  primary 
interest  is  the  time  of  the  next  regularly  scheduled  checkup.  At  that  time, 
new  data  will  be  available  that  may  change  the  estimate  of  risk. 

In  a  similar  manner,  the  model  used  here  estimates  the  survival  function 
of  an  individual  from  the  moment  of  his  last  checkup  until  the  time  of  his 
next  exam.  At  the  time  of  the  next  scheduled  checkup  a  new  assessment  will  be 
made.  We  indicate  the  time  that  the  patiep  is  at  risk  by  t;  thus  t=0  at 
the  time  of  the  most  recent  checkup  and  t-l  at  the  time  of  the  next  sched¬ 
uled  checkup.  All  covariates  including  past  time-dependent  ones  can  be  con¬ 
sidered  to  be  fixed  at  t=0  and  to  remain  so  until  c-1.  Since  an  event  must 
occur  in  the  interval  between  two  successive  checkups,  0  _<  ti  _<  1  for  all 
subjects  suffering  an  event.  The  survivors,  those  who  have  never  suffered  an 
event,  have  r-j  =  l  for  each  time  they  survive  over  the  period  between  check¬ 
ups  without  an  event,  at  which  time  new  data  becomes  available  and  Ti=0 
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again.  Ideally  the  data  is  collected  and  analyzed  each  period  so  that  changes 
in  the  population  are  built  into  the  model.  In  practice,  a  number  of  years 
may  be  clustered  together  so  that  there  are  a  reasonable  number  of  events. 
Then  survivors  appear  repeatedly  with  expanded  data  sets  at  each  new  checkup. 
This  leads  to  the  problem  of  dependent  sample  members,  but  our  research  indi¬ 
cates  that  the  problem  may  be  solved  by  subsampling  without  replacement. 

The  chief  advantage  of  the  periodic  checkup  predictive  model  lies  in  the 
short  horizon  for  the  failure  time.  Loss  to  followups  becomes  less  of  a  prob¬ 
lem  since  it  is  necessary  only  to  establish  that  the  subject  survived  until 
the  time  of  his  next  scheduled  checkup.  Reestimati ng  the  parameters  each 
period  allows  changes  in  the  population  to  be  quickly  built  into  the  model. 
The  model  will  more  closely  fit  the  observations  since  it  need  only  predict 
over  a  short  period  using  recent  data.  Finally  the  model  is  found  to  be  com¬ 
putationally  easy  to  implement. 

THE  PERIODIC  CHECKUP  PREDICTIVE  MODEL 
The  model  implemented  here  is  based  on  the  following  assumptions: 

(i)  each  individual  is  examined  annually  or,  more  generally,  at  the  er:d  of 
some  fixed  interval; 

(ii)  each  examination  consists  of  identical  tests  and  readings; 

(iii)  records  of  at  least  two  previous  examinations  are  available  at  the 
time  of  the  most  recent  examination  for  each  individual; 

(iv)  the  occurrence  of  an  event  (heart  attack,  say)  before  the  next  scheduled 
examination  is  relatively  rare; 

(v)  once  the  measured  covariates  such  as  age,  blood  pressure,  body  mass,  and 
certain  fixed  covariates  are  accounted  for,  all  individuals  are  equally 
at  risk  until  the  next  scheduled  examination. 
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These  assumptions  are  implemented  by  a  proportional  hazard  model  with 

constant  base  risk.  Let  t  denote  the  age  of  the  subject  at  the  time  of  his 

last  examination,  and  x  the  future  time  where  t=0  at  the  time  of  the  last 
examination.  Let  z(t)  be  the  time-dependent  covariates  for  the  last  and  two 
previous  examinations,  and  x  the  time-independent  covariates.  Let  y  e  Rm, 
m>l  denote  appropriate  transformations  of  z(t)  and  x  that  have  been  found 
to  be  informative  about  the  chance  of  an  event.  The  hazard  rate  X(x,z(t),x) 
is  assumed  to  have  the  form 

X(x,z(t)  ,x)  -  X0e-  ^  for  0  _<  x  _<  1  (1) 

8  T  y 

and  P[x=l]  =  exp(-X0e-  -} ,  where  T  indicates  vector  transpose  and  the 

parameter  vector  8  £  Rm.  This  model  is  a  version  of  one  due  to  Taulbee 
(12)  and  as  such  is  a  general i zation  of  Cox's  model  for  two  cancer  popula¬ 
tions'  survival  functions.  It  holds  that  the  base  population  risk  X0  is 

constant  over  the  one  period  time  interval  until  the  next  examination.  At 
that  time,  the  estimate  for  X0  is  updated  to  account  for  any  change  in  the 
population's  prior  risk  of  an  event.  For  example,  in  recent  years,  it  appears 
that  fewer  heart  attacks  are  occurring  among  middle-aged  men,  suggesting  that 
Xq  should  be  successively  lowered.  The  model  is  also  a  generalization  of  the 
Wei  bull  base  hazard  rate  model.  To  see  this,  note  that  the  Wei  bull  model 
hazard  has  the  form 

X(t,z(t),x)  =  X0a( x0 1 )a_1 e~I~  +  Iz? 

=  (X^a)exp[ (a-1)  1  og  *  +  y[z  +  yjx]  =  X^Jv  (2) 
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where  y  transforms  age  variable  t  to  the  variable  log  t.  The  difference 
between  equations  1  and  '?.  is  that  equation  1  gives  the  hazard  in  terms  of 
future  time  t,  taking  age  at  last  checkup  and  the  recorded  covariates  to  be 
fixed  until  the  next  examination.  On  the  other  hand,  equation  2  expresses  the 
hazard  for  age  t,  and  hence  does  not  use  age  as  a  covariate.  The  advantage 
of  equation  1  over  equation  2  is  that  equation  1  requires  fitting  the  data 
only  over  the  interval  until  the  next  scheduled  checkup  while  equation  2 
requires  a  data  fit  essentially  from  birth  until  the  age  of  an  event  or 
censoring.  A  second  technical  advantage  is  that  in  equation  1  we  may  estimate 
parameters  by  the  method  of  maximum  likelihood  while  it  can  be  shown  that  no 
maximum  likelihood  estimates  exist  for  the  parameters  of  equation  2,  and  other 
less  developed  techniques  must  be  utilized. 

To  find  the  maximum  likelihood  estimates  (MLE)  for  X0,B  from  equation 
1,  we  note  that 


X(T,z(t),x)  =  X (t ,y) ,  say 

=  f(x,y)/(l-F(T,y)). 

This  implies  that  the  survival  function  is 

T 

S(t ,y)  =  exp{-\0e?  ^t) . 

For  a  sample-  ;.f  size  n,  suppose  that  r  individuals,  indexed  by 
l,...,r,  have  failed  (suffered  an  event)  prior  to  the  time  of  their  next 
scheduled  checkup  while  n-r,  indexed  by  r+l,...,n,  have  survived  through  the 
time  interval.  Scale  t  to  equal  U  at  the  time  of  last  checkup,  and  to  1 
at  the  time  of  next  scheduled  checkup.  Thus  the  failure  times  are  0 _<  ij  <  1 


* 
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for  j=l,.„.,n,  with  cj=l  for  j=r+l,..,n.  The  likelihood  function  for 
the  sample  is 

L  =  Xrexp{  l  STy j  -  x  l  x ieS1^1 } . 

0  j=l~  ~  0 i =1 

Hence  the  log  likelihood 


r  n  rT  . 

*  »  r  log  X0  +  l  6 Ty j  -  X0  l  xie£  ¥' . 

j=l~  '  i-1 

Taking  the  partial  derivative  of  l  with  respect  to  X0 ,  and  setting  it  equal 
to  0,  we  find  the  MLE  for  X0  to  be 

X0  ^  p/  l  (3) 

i  =  l 


a 

Substituting  X  into  £  we  have  the  log  likelihood 

*  n  t  r 

l  =  r  log  r  -  r  -  r  log  £  x^e-  ^  +  l  B*yj  .  (4) 

i=l  j=r 

/<  A  A 

To  find  the  MLE  of  B  from  z,  we  take  the  gradient  of  &,  8£/3B=h(B),  say. 
We  find  a  convenient  matrix  form  for  h(e):  let  Y  be  the  n*k  matrix  of  sample 
points  for  k  transformed  covariates  observed  for  n  subjects.  Let  P  be 
the  n-diinensional  column  vector  with  elements 


_  T.aBTyi/  ?  T.J»Tyi 

•M  -  T-,..-  -  ,  l  T  -j  - 

1  =  1 


/c\ 
\D ) 


Let  A  be  the  n-dimensional  column  vector  of  failure  indicators 


.  <5 -j  =  1  if  subject  i  failed. 

‘  =  0  if  subject  i  survived. 


8 


Then 


(6) 


h(0  )  YT(A-rP) . 

We  find  the  zeros  for  this  set  of  k  simultaneous  equations  by  the  Newton- 
Raphson  technique.  Let  D  be  the  rixn  diagonal  matrix  with  diagonal  ele¬ 
ments  Pi-  Then  the  nxk  matrix 

3P/3H  --  (D-PPT  j  Y  .  • 

Hence 

ah(8)/3P  =  -rYT(D-PPT)Y.  (7) 

lhe  Newt on-Raphson  method  iteratively  solves  for  B  by  updating  values 
Ba  to  80+1  where 

Pun  -  0a  +  [VT(D-PP')Y]_1YT(A/r-P).  (8) 

There  is  a  heuristic  interpretation  for  some  of  these  equations.  In 

equation  3,  A„  is  the  ratio  of  the  average  number  of  failures,  r/'n,  to  the 

n  T 

avereye  value  of  the  covariate  effect  xexp  By,  namely,  )  tie-  ~i/n,  for 

'  '  1  =  1 

the  given  sample  of  size  n.  Thus  \(l  plays  the  combined  role  of  providing 
an  estimate  for  the  population  failure  rate  between  examinations  plus  a  cen¬ 
tering  estimate  that  leaves  the  y-j  invariant  under  uniform  change  of  lo¬ 
cation.  This  role  is  analogous  to  that  of  the  constant  term  in  discriminant 
analysis  which  incorporates  both  prior  probabilities  and  an  overall  mean. 
Setting  h ( B )  equal  to  0  in  6  and  dividing  by  r  shows  that  the  MLE  B0 
occurs  when 

7r  =  YIP  =  Epy,  say, 

9 


/ 


n 

since  each  Pi  2  0  and  l  p^  =  1.  That  is,  the  maximum  likelihood  occurs 

i=l 

when  the  expected  value  for  the  covariates  computed  by  using  the  B  in  equa¬ 
tion  5  equals  the  sample  average  for  the  failure  group.  If  the  y-j  display 
a  general  shift  between  the  failure  and  the  survival  groups,  then  the  B  will 

tend  to  reflect  this  shift,  to  give,  in  general,  positive  values  for  eTy-j 
in  the  failure  group  and  negative  values  in  the  survival  group.  This  shift 
will  be  magnified  in  cases  where  t,  is  small,  and  hence  subjects  who  have 
events  soon  after  their  checkup  will  be  weighted  more  heavily  in  determin¬ 
ing  S.  Equation  8  will  be  recognized  as  weighted  least-squared  regression  to 
compute  the  increment  in  $  at  each  iteration. 

There  is  also  a  close  resemblance  between  Cox's  (3)  proportional  hazard 
model  for  life  tables  from  censored  data  and  the  model  introduced  here.  Equa¬ 
tion  4  shows  that  the  log  likelihood  of  our  model,  up  to  a  constant  term  in 
r,  i  s 


l  {STyj  -  1  og[  l  T-je-1^  ] } .  (9) 

j=l  '  1-1 

In  our  notation,  the  log  likelihood  of  the  Cox  model  (3)  becomes 

I  {eTyj  -  iog[  l  e?V]}  (10) 

j=l  "  ~  i'ER(Tj) 

where  R (x j )  is  the  index  set  of  survivors  at  failure  time  xj.  Let  us 
assume  that  0  <  ti  <  x2  x r  _<  1  and  recall  that  x -j  =  1 ,  1=r+l,...,n. 

Also  in  the  application  to  Air  Force  flyers  it  is  reasonable  to  assume  that  no 
censoring  occurs,  since  very  few  of  the  subjects,  if  any,  will  be  lost  to  fol¬ 
lowup  over  the  course  of  the  period  between  checkups.  Then  {r+l,..,n}  will 
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be  contained  in  R(tj)  for  each  j,  j=l . r.  Let  r(-tj )=R(x j )-{r+l , . . . ,n} , 

the  index  set  of  failing  subjects  who  survive  past  tj  ;  if  there  are  no 
ties  among  the  failure  times. 


r(Tj)  =  { j - 1 , . . . , r } . 

The  difference  between  expressions  9  and  10  is  then 

l  {l<Hl[  I  e~  +  l  e?  -1'1  -  log 
,1  =  1  ker(-r-j)  i=r+l 

We  see  that  the  key  distinction  between  the  two  models  in  the  application  to 
periodic  checkups  of  Air  Force  flyers  is  that  the  Cor;  model  uses  only  the  in¬ 
formation  that  flyer  j?  survived  flyer  j t  if  x(ji)  <  a (.12),  1  <  jj  <  j2  r 

while  the  periodic  checkup  model  uses  the  actual  time  after  checkup  to  weight 
bT,  . 

the  e~  In  practice  we  suspect  that  the  two  models  will  lead  to  similar 

estimates  of  the  £3.  Also  because  the  number  of  failures  per  year,  r,  is 
small  relative  to  the  sample  size,  n,  and  because  loss  to  followup  is  negli¬ 
gible,  the  Kaplan-Meier  (b)  estimates  of  x0(t),  the  base  hazard  rate,  will 
be  nearly  constant  in  our  application.  For  these  reasons,  plus  the  mathemati¬ 
cal  tractability  of  the  full  likelihood  technique,  we  have  opted  for  the  monel 
proposed  here  over  the  Cox  model,, 

IMPLEMENTATION  OF  THE  MODEL 

Five  parts  to  the  implementation  of  the  periodic  checkup  model  were 
addressed  before  creating  the  actual  program;  (i)  the  use  of  a  subsample  from 
the  surviving  population,  (ii)  transformations  and  selecting  reexpressions  of 
the  data,  ( i  i  i )  calculation  of  the  initial  estimate,  (iv)  estimation  of  the 


[  l 

j  =  l 


n 

+  l  e: 

i  =  r+l 


eTyi 


]}. 
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parameters  by  the  Newton-Raphson  procedure,  and  (v)  verification  af  the 
procedure  by  reusing  the  subsample.  These  will  be  considered  in  turn. 

Use  of  a  Subsample 

Because  of  the  rarity  of  the  event,  perhaps  of  the  order  of  two  heart 
attacks  per  year  for  each  1000  at  risk,  it  will  be  necessary  to  use  ,i  large 
sample  size  over  a  number  of  years  to  find  ..  useful  number  of  cases.  The 
mixture  of  years  makes  the  procedure  less  sensitive  to  changes  in  the  popula¬ 
tion,  such  as  the  pos' ible  decrease  in  heart  attacks  In  recent  years.  How¬ 
ever,  without  a  large  program  to  collect  data  each  year,  it  will  have  to  be 
assumed  that  there  is  little  or  no  change  in  the  population  at  risk  over  a 
period  sufficiently  long  to  gather  a  reasonably  large  number  of  events. 

Let  us  consider  the  example  of  the  data  set  gathered  by  considering 
approximately  3000  flyers  at  risk  over  the  period  1^74-1978.  To  assume  that 
there  were  at  least  two  checkups  previous  to  the  checkups  that  began  the  risk 
period,  the  risk  period  was  taken  to  be  the  two  years  1976-1978.  During  this 
period,  eight  of  the  3000  were  admitted  tc  hospitals  with  the  primary  diagno¬ 
sis  of  acute  myocardial  infarction,  an  average  of  1.33  per  1000  per  year. 
Suppose  that  the  1976  and  1977  risk  years  are  ooth  used.  The  presence  of  a 
1978  checkup  merely  indicates  that  the  subject  was  not  lost  to  toiiowup  during 
the  2-year  risk  period  1976-1978.  Then  the  total  number  of  risk  cases,  using 
both  years,  is  roughly  6000.  That  is.  In  the  direct  model,  6000  cases  need  to 
be  analyzed  to  find  eight  events.  The  proposed  alternative  was  to  sample 
systematically  20%  of  the  3000  from  each  of  two  strata  based  on  age  and  to  use 
the  second-to-1 ast  checkup  plus  the  two  previous  to  that  one,  usually  1976  to 
1977,  only.  When  the  small  number  of  Incomplete  cases  were  eliminated,  653 
subjects  remained.  However,  all  eight  event  cases  would  be  used.  To  check 
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the  precision  of  the  estimates,  we  compare  the  ratio  of  the  two  standard  devi¬ 
ations  of  the  estimates  from  the  samples.  We  find 


g[ (1/8)  +  (1/561)  3l  =  .3561 

o[(l/8)  +  (1/6000)]1/2  .3538 


We  conclude  that  only  negligible  gains  in  precision  are  possible  using  all 
6000  risk  cases.  To  check  the  bias  of  the  subsample  method,  we  assume  that 
each  member  of  the  stratified  sample  represents  ten  “identical  clones"  in  the 
fuH  sample.  This  assumption  appears  to  be  reasonable  for  this  large  sub¬ 
sample  suggesting  that  a  relatively  small  ni rie-meinber  neighborhood  can  be  con¬ 
structed  in  the  full  sample  for  most  of  the  subsampled  cases.  It  is  similar 
to  the  assumptions  in  stratified  sampling.  Then  the  maximum  likelihood  esti¬ 
mate,  6,  for  3  from  the  subsample  of  size  n  satisfies 


r 


r  It  n  It  n  *t 

=  (  >:  /i e~  **  +  >:  yie!  *1)/  l  e£  £1. 
i  =  l  i  =  r*-l  i  =  l 


(ID 


The  maximum  likelihood  estimate  from  the  "cloned  group"  satisfies 


r_l  ).  yj  =  (  )(  >'ie-  ^  +  10  )'  y1-e-T^i)/(  £  e-1^  +  10  )'  e-^i) 

j  =  l~  i-1”  i=r+l~  i=l  i=r+l 


=  (I0_1  l  y^^ 

i  --r 


n  _t  r  t  n 

);  yie5  1/(10" 1  l  e!7i  +  \ 

i=r+l~  i =1  i=r+l 


eST>i). 


(1?) 


This  suggests  that  if  n  is  sufficiently  large  relative  to  r  that  the 
effect  of  the  r  first  terms  in  the  numerator  and  denominator  of  the  right, 
hand  of  equation  1  is  small  and  the  two  maximum  likelihood  estimators  of  8 
will  be  very  similar.  Here  8  over  553  is  roughly  1  in  70,  and  the  8  failing 
cases  do  not  display  data  dramatically  different  from  the  353  control  cases. 
We  conclude  that  the  8  estimates  from  the  subsample  are  both  reasonably 
accurate  and  precise  enough  to  justify  the  use  of  the  subsample  approach. 

The  parameter  estimate  that  will  display  bias  (though  not  lack  of  preci¬ 
sion)  will  he  A q  ,  the  MLE  for  the  base  hazard  rate  constant  AQ.  Here 


n 

r/  l 

i=l 


is  composed  of  the  average  rate  of  events  in  the  subsample  r/n  and  the  con¬ 
stant  term  for  the  covariate  coefficients  e-"^.  But  the  average  rate  for 
the  population  is  not  8/561  hut  rather  8/6000.  This  is  easily  corrected  by 
substituting  .75,  the  average  hazard  rate  for  the  subsanmle,  for  r  in  X0. 
This  may  be  verified  by  notinq  that 

P[T>l|y]  »  e-*oe?T* 


is  typically  .<195  or  larger.  Then  we  have  the  approximation 

l  P[T j  _<  1 1 y i ]  =  l  [l-exp{-x0e5T^}] 
i=l  i=l 


fv<  "  By 4 ,  "  BV4 
=  l  x0e--'  =  l  re--'/  l  e-*1 


i-1 


i  =  l 


i-1 


Thus  the  expected  number  of  failures  in  the  subsample  roughly  equals  the 
numerator  of  X0;  that  is,  r  should  equal  .75.  This  correcting  factor  fo1'  Xq 
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is  necessary  when  the  event  in  question  is  so  rare  that  a  stratified  subsam- 
pling  technique  is  mandated,  but  inference  of  survival  probabilities  involves 
the  entire  risk  set. 


Transfomnati ons  of  the  Data 

One  of  the  chief  advantaqes  of  the  periodic  checkup  model  with  its  con¬ 
stant  update  of  the  survival  probability  is  that  the  time-dependent  data  can 
be  considered  to  be  fixed  at  time  t=0.  Following  Frank  (5)  we  feel  that 
three  checkups  provide  sufficient  information  about  the  subject's  state  to 
limit  our  consideration  only  to  those  three  most  recent  examinations.  Three 
orthogonal  time-series  transformations  of  the  data  check  for  constant,  linear, 
and  quadratic  trends  in  the  data:  (1  1  1),  (-1  0  1),  and  (1  -?.  1).  We  con¬ 
sidered  these  as  representatives  of  general  classes  of  time-dependent  trans¬ 
formations.  The  first  class  or  block  of  constant  transformations  presently 
built  into  the  program  has  the  coefficient  vectors:  (1  1  1),  (0  0  1),  (0  1  0), 
(1  0  0),  and  (1  2  4).  The  second  or  linear  block  is  composed  of  (0  -1  1), 
(-1  0  1),  (-1  1  0),  and  (-1  -1  2).  The  final  block  has  the  single  quadratic 
coefficient  vector  (1  -2  1).  Each  coefficient  vector  (ai  a2  a3)  is  used  to 
find  the  corresponding  trend  in  the  data  for  each  scalar  variable.  Let  x0 , 
x_i  ,  x_2  represent  the  value  of  the  covariate  at  the  last  and  most  recent 
checkup,  at  the  previous  one,  and  at  the  second  to  last  examination.  Then  we 
compute  the  inner  product  a^..?  +  a?x_i  +  a3XQ  to  get  a  transformed  scalar 
variable  y.  This  transformation  y-j  is  computed  for  all  failing  subjects, 
i-l,...,r,  and  surviving  subjects,  i=n-l , . . . ,n .  The  value  of  y  as  a  dis¬ 
criminator  is  determined  by  computing  a  t  statistic  for  each  transformat i  >n 
y  for  the  two  samples  of  failures  and  survivors.  The  variance  estimate  is 
derived  from  the  covariance  matrix  for  each  variable  over  the  three 


examinations  from  the  surviving  group  since  it  is  possible  that  the  failing 
group  may  not  have  identically  distributed  cises,  cf.  Shea  (11).  Since  r  is 
small  relative  to  n,  the  effect  is  minimal  in  any  case.  Then 

t  =  (yr  -  yn-r)[ (1/r  +  1/n-r) (aTSa)]'1 /2 

where  aT  =  (at  ,  a2  ,  a3);  yr  and  )yn.r  are  sample  averages  for  transformed 
variable  y,  and  S  is  the  autocovari ance  matrix  for  the  variable  under  consid¬ 
eration  from  the  surviving  sample.  Since  the  statistic  has  n-r=553  degrees 
of  freedom  for  the  denomi nator,  and  r=8,  it  may  be  safely  assumed  that  t  is 
close  to  having  the  standard  normal  distribution.  Therefore  any  t  value 
greater  than  one  suggests  that  the  chance  is  less  than  one-third  that  the  two 
populations  have  identical  mean  values  for  this  transformation. 

To  select  from  among  the  thirty  t  values  computed,  we  imposed  some 
prior  constraints.  First  it  was  decided  that  age  and  at  least  one  transforma¬ 
tion  of  each  variable  would  be  included.  This  reflects  the  belief  that  age 
and  each  variable  observed  are  recorded  because  experience  has  related  each  of 
these  to  the  chance  of  an  event.  The  transformations  with  highest  t  values 
are  employed  subject  to  these  constraints  until  from  six  to  nine  such  trans¬ 
formations  have  been  selected.  In  the  sample  run,  the  maximum  number  of  time- 
dependent  data  transformations  was  set  to  six. 

The  variables  are  modeled  to  be  log-linear  in  the  hazard  rate  and  log- 
loglinear  in  the  survival  function.  It  is  natural  to  ask  if  a  reexpression  of 
the  data  would  serve  to  separate  the  two  samples  better.  Two  reexpressions 
were  tried  by  taking  the  natural  log  and  the  inverse  of  each  raw  data  value. 
These  were  motivated  by  the  fact  that  most  variables  considered  were  ratios 
such  as  mm  Hg/cm2  or  kg/cm2,  and  also  these  reexpressions  have  traditi onal ly 
been  considered  ’ery  fruitful  in  other  applications;  cf.  Box  et  al.  (1).  No 
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distinctly  better  separations  of  the  data  were  found  under  either  reexpression 
except  for  a  sliqht  improvement  from  using  the  logarithm  of  the  body  mass 
index  (BMI).  The  improvement  was  not  deemed  sufficiently  large  to  justify  the 
loss  of  flexibility  in  the  program  introduced  by  taking  the  logarithm  of  one 
variable  while  the  others  were  unchanged.  Therefore  no  reexpression  of  data 
is  built  into  the  program,  but  an  investigator  may,  if  he  chooses,  rvexpress 
data  before  entering  it. 

The  option  to  include  new  variables  (e.g.,  smokinq  and  family  history) 
which  are  known  risk  factors  is  included  in  the  proaram.  These  variables 
specified  by  the  user  together  with  the  temporal  contrasts  selected  by  the 
above  t-values  would  then  be  used  to  construct  estimates  of  \0*B  and  S(t). 

Calculation  of  the  Initial  Estimate 

Formally,  the  iterat  ve  procedure  for  estimating  the  B  is  similar  to 
that  for  estimating  logistic  discriminant  function  coefficients;  cf.  Press  and 
Wilson  (10).  For  this  reason,  the  linear  discriminant  coefficients, 

3q  =  S  t,  X  r  ^n~r )  * 

where  Xr  and  X„_r  are  the  average  covariate  vectors  for  those  who  failed  and 
did  not  fail  and  S  is  the  pooled  covariance  matrix  of  the  covariate  vectors, 
were  used  as  the  initial  values  for  the  iterative  solution  of  the  maximum 

likelihood  equations.  The  sample  run  verified  that  Bo  was  a  good  initial 
value.  Since  Bo  is  basically  independent  of  the  sample  size,  this  close  ap¬ 
proximation  between  P0  and  p,  verified  that  the  effect  of  using  a  suhsainple 
rather  than  the  full  sample  is  probably  negligible. 
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The  Newton-Raphson  Procedure 

The  Newton-Raphson  procedure  for  finding  the  MLE  is  based  on  the  assump¬ 
tion  that  the  zeros  of  the  first  derivative  provide  the  global  maximum  of  the 
likelihood.  Since  the  log  likelihood  is  differentiable  everywhere  on  the 
domain  of  0,  since  there  is  a  unique  critical  point,  and  since  the  Jacobian 
of  the  derivative,  the  matrix  of  second-order  derivatives,  is  the  negative  of 
a  positive  definite  matrix,  the  Newton-Raphson  procedure  indeed  leads  to  the 
unique  MLE.  Moreover,  the  inverse  of  the  matrix  of  second-order  derivatives 
evaluated  at  0,  denoted  by  GINV  in  the  final  iteration  of  the  program,  is 
the  Fisher  information  matrix. 

Verification  by  Sample  Reuse 

The  program  has  been  written  to  apply  the  6  and  X0  (corrected)  MLE  to 
the  subsample  cases.  This  allows  the  investigator  to  decide  on  the  apparent 
error  of  a  procedure  that  predicts  an  event  if  the  probability  P[T<l|z(t)] 
is  greater  than  p0 ,  say,  and  predicts  no  event  otherwise.  However,  sample 
reuse  leads  to  a  favorable  bias  on  the  error  of  the  procedure;  cf.  Lachenbruch 
(7).  A  better  idea  of  the  error  can  be  computed  from  the  bootstrap  methods 
summarized  by  Efron  (4).  However,  in  initial  trials  such  as  this,  the 
apparent  error  provides  some  idea  of  the  usefulness  of  the  procedure.  The 
sample  run  here  provides  the  following  estimates  of  false  positives  (survivors 
who  would  have  been  predicted  to  fail  during  the  interval  between  their  last 
recorded  checkups,  usually  the  year  1977-1978)  and  false  negatives  (acute 
myocardial  infarction  patients  who  would  not  have  been  considered  at  risk)  for 
various  values  of  p. 
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Po 

False  +  of  553) 

False  -  {%  of  8) 

.0012 

39.2 

0.0 

.0014 

33.3 

12.5 

.0016 

25.9 

25.0 

.0018 

21.5 

37.5 

.0020 

17.2 

50.0 

.0025 

10.1 

62.5 

.0030 

6.9 

87.5 

.0035 

4.0 

100.0 

This  sample  reuse  estimate  suggests  that  if  p0=.0016  is  used,  only 
one-fourth  of  the  survivors  and  one-fourth  of  the  event  cases  would  be  mis- 
classified,  However,  even  if  this  optimistic  estimate  is  true,  hundreds  of 
false  positives  would  appear  among  the  3000  subjects  at  risk  each  year.  We 
conclude  that  the  information  provided  by  the  data  presently  available, 
namely,  systolic  and  diastolic  blood  pressures,  age,  and  body  mass  index,  is 
still  insufficient  to  allow  the  computed  probability  to  be  anything  more  than 
a  convenient  summary  statistic  to  the  examining  physician.  As  more  signifi¬ 
cant  variables  are  added  to  the  covariate  history  (e.g.,  smoking  behavior, 
family  history  of  coronary  heart  disease,  trig! iceride  and  lipid  readings, 
etc.)  one  would  expect  to  see  improved  discrimination  results  for  the  proce¬ 
dure. 

PROGRAM  DESCRIPTION 

Data  Fi 1 es 

Three  input  data  files  are  needed  to  run  the  program:  INPUT,  DATAS, 
and  DATAF. 
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1.  'INPUT'  file:  contains  some  constants  needed  to  run  the  program. 


(a) 

Number 

of  Cards 

in  file: 

13 

(b) 

Layout 

of  Card 

1: 

Field 

Length 

Type 

Variable 

1 

8 

Real 

EFAIL: 

Average  #  of  failures  for 
size  of  control  group 

11 

2 

Integer 

CYFAR : 

Year  of  last  exam 

13 

4 

Integer 

NVAR: 

#  of  time-dependent  vari¬ 
ables  to  select 

17 

4 

Integer 

NAV: 

#  of  time-independent  vari¬ 
ables  to  select 

21 

8 

Real 

XINC: 

Increment  of  frequency 

table 

(c) 

NVAR  must  be  between  5  & 

9,  and  NAV 

must  be  between  083. 

Also  the  sum  must  be  less 

than  or  equal  to  9. 

(d) 

Layout 

of  Cards 

2  -  11: 

Field 

Length 

IZEi 

Variabl e 

1 

4 

Real 

A(l,D 

Weight  of  1st  year 

5 

4 

Real 

A( 1 .2) 

Weight  of  2nd  year 

9 

4 

Real 

A( I ,3) 

Weight  of  3rd  year  (year 
prior  to  last  exam) 

I  =  1,. 

...10 

(e) 

Layout 

of  Card  12: 

Field 

Length 

Type 

Variable 

1 

4 

Integer 

BB(1): 

boundary  of  1st  block 

5 

4 

Integer 

BB(2) : 

boundary  of  2nd  block 

9 

4 

Integer 

BB(3) : 

boundary  of  3rd  block 

20 


1 


(f)  Layout  of  Card  13: 


Field 

Length 

Me 

Variable  1 

1 

4 

Char. 

Name(l) : 

1 

name  of  1st  time-  : 

dependent  variable 

i 

5 

4 

Char. 

Naine(2) : 

name  of  2nd  time-  « 

dependent  variable 

9 

4 

Char. 

Name(3) : 

name  of  3rd  time-  , 

dependent  variable 

13 

4 

Char. 

Name(4) : 

Always  AGE 

(g)  A  copy  of  sample  data  is  included. 

2.  'UATAS'  file:  contains  records  of  control  group. 

(a)  Number  of  records  of  this  file  does  not  have  to  be  known  so  long 
as  the  limit  is  not  exceeded. 

(b)  Number  of  cards  per  record:  2 

(c)  Layout  of  Card  1: 


eld 

Length 

Type 

Variable 

1 

9 

Integer 

ID  or  blank  (not  used  in  program) 

10 

2 

Integer 

Year  of  birth 

12 

2 

Integer 

Month  of  birth  (not  used  in 
program) 

14 

2 

Integer 

Day  of  birth  (not  used  in  program) 

16 

2 

Integer 

No.  of  month  survived  after  last 
exam. 

18 

2 

Integer 

Year  of  last  exam  prior  to  disease 

20 

4 

Integer 

Time-independent  variable  1 

24 

4 

Integer 

Time-independent  variable  2 

28 

4 

Integer 

Time-independent  variable  3 

21 
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(d)  Melds  16-19  of  Card  1  are  blank  in  this  file. 

(e)  There  is  no  time-independent  variables  in  DATAS/DATAF  at  present 
time. 

(f)  Sample  size:  553 

(g)  Total  number  of  records  in  DATAS  8  DATAF  may  not  exceed  570; 
otherwise,  the  dimensions  in  the  program  must  be  increased. 

(h)  Layout  of  Card  2: 


eld 

Length 

Type 

Variabl e 

1 

8 

Real 

Time-dependent 

variable 

1 

of 

Year 

9 

8 

Real 

Time-dependent 

vari able 

2 

of 

Year 

17 

8 

Real 

Time-dependent 

vari abl e 

3 

of 

Year 

25 

8 

Real 

Time-dependent 

vari  able 

1 

of 

Year 

33 

8 

Real 

Time-dependent 

vari able 

2 

of 

Year 

41 

8 

Real 

Time-dependent 

variable 

3 

of 

Year 

49 

8 

Real 

Time-dependent 

variable 

1 

of 

Year 

57 

8 

Real 

Time-dependent 

variable 

2 

of 

Year 

65 

8 

Real 

Time-dependent 

variable 

3 

of 

Year 

1 

1 

1 

2 

2 

2 

3 

3 

3 


3.  'DATAF'  file:  contains  records  of  disease  group. 

(a)  See  DATAS  for  detail,  except  columns  16-19. 

(b)  Sample  size  for  this  data:  8 

Library  Routines 

We  use  one  IKSl.  routine  L1NV1F  in  the  program,  which  finds  the  inverse  of 
a  matrix.  Several  Fortran  functions  are  used.  In  particular,  MNFLIB  is 
loaded  together  with  IMSLJB. 
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Sample  Job  Setup 


The  example  runs  are  done  on  CDC  Cyber  170  machine  under  UT2D  operating 
system.  The  necessary  commands  are: 

REAUPF,  (Tape  Name),  MAIN,  INPUT,  DATAS,  DATAP . 

RFL, 77700. 

MNF ,  1=MA1N. 

LOAD ,LGO,MNFLIB, IMSLIB 

which  assumes  the  files  are  sorted  at  some  permanent  file  storage.  The  result 
will  be  a  file  OUTPUT. 

Extensi ons 

1,  The  example  run  does  not  contain  any  time-independent  variables. 
These  data  may  be  inserted  in  Card  1  of  each  record  in  DATAS/DATAF  and  set  NAV 
(number  of  additional  variables)  accordingly. 

2.  The  limit  of  NVAR  +  NAV  <  9  may  be  increased  to  12  so  that  we  can 
test  more  variables.  To  do  so,  we  need  to  redimension  the  following  arrays: 


-  Y 

i  n 

Main 

-  Y,XT 

i  n 

Select 

-  X,S,l3eta,Betas,Mean,XTX,Tl  ,G,GINV,T5 

i  ri 

LSQ 

Also  Maxvar  has  to  be  set  to  1?  in  the  main  program. 

3.  The  limit  of  no  more  than  70  records  may  be  increased  to  any 
reasonable  number.  Maxcas  in  main  has  to  be  set  to  reflect  the  change.  The 
following  arrays  have  to  be  redimensioned: 


-  X, Y.Tau, Event ,ADDV 

i  n 

Main 

-  X.Tau.Y, Event, ADD V 

i  n 

Sel ect 

-  X, PI, P, 11, Tau , X OLD, SP, Event 

in 

LSq 
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4.  The  weights  for  variable  selection  may  be  changed  by  changing  the 
corresponding  data  cards  in  INPUT  file.  However,  the  second  one  cannot  be 
changed  because  it  is  used  to  compute  the  T-value  of  Age  of  last  year. 

5.  Current  block  boundaries  are  5,9,10,i.e., 

Block  1:  1-5 

Block  2:  6-9 

Block  3:  10 

This  may  be  changed  by  changing  the  data  on  the  12th  Card. 

6.  Names  for  time-dependent  variables  may  be  changed  by  using  a  differ¬ 
ent  data  card  at  the  end  of  INPUT  file  (no  more  than  4  characters  per  name). 
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General  Structure  of  Algorithm 
START 


+ 


LOW  CHARTS 


27 


Initlali 


zation 
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SX  mean  &  covar¬ 
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group 


\ 
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Chart  2.0  [Subroutine  SELFCT] 
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Input  Yob,  Month 
Yoe,  X(N2+1,1J) 
i=l,3;  j=l ,3 


End-o 


N 2  Event  = 

N2  +  1 

,  (N2)-l 

t(N2}«.04  + 

month  *  .08 

X(N2,I,4)=Yoe-Yob-3+I 

1=1 

2,3 

Compute  Mean  2 
N2  =  N2  -  N1 
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Flow  Chart  2.1  [Subroutine  SELECT] 


f-File 
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Flow  Chart  2.3  [Subroutine  SELECT] 


Compute 

Mean(l) , 

Mean(2) 

• 

-  Mean(l)2  A  X^*X  -  Mean(2) 
~NR  NQ 


NCASE  -  2 


Index  =  Index  +  1 


low  Chart  3.2  [Subroutine  LSQ] 


SP ( i )  =  1  -  e^&Txi 


START 


blow  Chart  4.0  [Subroutine  MAXT] 
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CONCLUSIONS  FROM  SAMPLE  RUN 

The  sample  run  provided  here  combines  eight  cases  of  myocardial  infarct 
and  553  systematically  selected  control  cases  from  two  samples  stratified  by 
age.  The  means  of  control  cases  for  systolic  and  diastolic  blood  pressure 
(SBP  and  DBP),  body  mass  index  (BMI ;  weight  divided  by  height  squared),  and 
age  are  presented.  On  the  average,  the  disease  group  has  higher  SBP,  higher 
DBP,  lower  BMI,  and  is  younger  than  the  control  group,  though  none  of  these 
are  significant. 

The  time-dependent  variables  SBP,  DBP,  and  BMI  were  examined  by  means  of 
blocks  of  time-series  transformations.  The  first  block,  seeking  constant 
trend,  considers  the  3-year  average,  the  individual  years,  and  an  average 
weighting  the  more  recent  years  more  heavily.  The  second  block  seeks  a  linear 
trend  and  considers  pairwise  increments  between  the  second  and  third  years, 
the  first  and  third  years,  and  the  first  and  second  years,  plus  the  difference 
between  the  average  of  the  first  2  years  and  the  last.  The  final  transforma¬ 
tion  considers  quadratic  trend.  Age  at  last  checkup  is  automatically 
included,  and  t-tests  are  used  to  find  the  best  separating  time-series  trans¬ 
formations  of  SBP,  DBP,  and  BMI.  These  are  most  recent  SBP,  DBP  from  2  previ¬ 
ous  years,  and  most  recent  BMI.  Since  the  option  of  six  time-dependent  trans¬ 
formations  was  selected,  the  increment  between  the  two  most  recent  SBP  and  the 
quadratic  SBP  were  selected  by  t-te'ts  to  be  included  in  the  model. 

Linear  discriminant  analysis  is  used  to  compute  the  initial  estimate  of 
p.  three  iterations  of  the  Newton-Raphson  procedure  are  required  to  find  the 
maximum  likelihood  estimates  of  p  to  an  accuracy  of  10"6 .  Inspection  of  the 
final  estimate  of  B  and  the  initial  estimate  given  by  discriminant  analysis, 
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plus  the  rapid  convergence,  suggests  that  discriminant  analysis  provides  a 
reasonable  initial  value  for  0. 

The  matrix  GINV  is  the  sample  estimate  for  the  Fisher  information  matrix 


J(e). 

Usi  ng 

thi  s 

result  we 

may  table 

the  standardized  z-score 

+  t)Jii 

(B)]'1 

/2)  for  each 

Bi  component 

Of  0. 

BETA  (I) 

VARIANCE 

Z-SCORE 

1 

-.12875809 

.1522 

-  .9267 

2 

.04446091 

.0172 

.9520 

3 

.02388326 

.0238 

.4347 

4 

-.18952202 

.1931 

-1.2109 

5 

-.01121009 

.0568 

-  .0132 

6 

.01699793 

.0146 

.3951 

We 

conclude  that 

the  disease 

group  is  composed  of  men  first  of  lower 

recent 

BMI ,  then 

of  higher  SBP  while 

being  younger 

,  then  of  higher  DBP  at  the 

first  examination  with  a  convex  quadratic  trend,  and  inconsequentially  of 
recent  decrease  in  SOP,  all  of  this  relative  to  the  control  group.  Therefore 
the  data  indicates  that  young  slender  men  with  high  SBP,  and  to  some  extent 
with  a  history  of  high  DBP  and  a  drop,  then  gain,  in  SBP  are  at  risk. 

The  summary  table  of  the  estimates  for  X0  an<*  B  is  based  on  the  ex¬ 
pected  number  of  failures  over  one  year  for  a  sample  of  561.  Earlier  we  indi- 

utnQ  uiut  U«/w>  Wo3  a  i  ca5uuauic  csumatc  ui  ca,»cllcu  i  a  n  ui  eS  i  ui  un  i  S 

sample;  however,  the  various  values  only  change  the  scale  of  the  probability 
of  an  event  and  not  the  relative  positions.  Therefore  any  reasonable  expected 
number  of  failures  may  be  used.  The  model  is  checked  by  revising  the  sample 
to  compute  a  frequency  table  for  the  estimates  of  an  event  during  the  next 
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year.  We  would  expect  that  the  estimated  probabi 1 ities  will  be  hiqher  than 
averaqe  for  the  disease  qroup.  Each  individual's  probability  and  transformed 
data  values  are  printed.  The  I  column  provides  an  index  from  1  to  561,  with 
the  553  control  cases  first,  indicated  by  a  0  in  the  EVENT  column,  and  the 
eiqht  disease  cases,  with  a  1  under  EVENT,  last.  The  probability  of  an  event 
during  the  year  between  checkups  is  computed  by 


P[Ti<l|z;(ti)]  -  1 


exp{ -X0e~T-i } 


The  probability  and  its  frequency  class  are  found  in  the  columns  headed  S  and 
INn,  respectively.  The  time  after  the  most  recent  checkup  to  failure  or  cen¬ 
soring  is  found  in  the  TAU  column;  if  the  subject  has  no  event,  that  is,  EVENT 
is  0,  then  TAU  is  1.00.  Otherwise  TAU  is  .04  +  (.00)  (number  of  months 
between  last  checkup  and  admission  to  hospital  with  acute  myocardial  in¬ 
farction).  finally  two  frequency  tables  are  constructed,  one  with  constant, 
class  size  0.0010  and  the  other  with  varyinq  class  size,  to  allow  a  more 
detailed  examination  of  the  empirical  distributions  for  hoth  groups. 

The  sample  run  does  not  allow  one  to  conclude  that,  the  data  used  satis¬ 
factorily  separate  the  control  and  disease  samples.  However,  some  success  may 
be  claimed  if  the  sample  reuse  procedure  can  be  believed.  We  normalized  the 
rate  of  events  at  ./5/561  or  1.33/1000  failures  on  the  average  over  a 
year.  Suppose  wo  consider  the  cases  with  jrobabillty  estimates  qreater  than 
.0014,  a  convenient  class  boundary  rlose  to  the  average  probability.  Then 
184/553  =  33.3%  of  the  control  group  exceeded  this  critical  value  wh  He  7/8 
~  8/. 5%  of  t.he  disease  group  exceeded  it.  It  would  bo  of  interest  to  see 
whether  or  not  the  control  subjects  of  highest,  risk  have  boon  admitted  for 
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acute  myocardial  Infarction  in  the  past  year.  However,  the  individual  risk  of 
heart  attack  is  sufficiently  small  that  the  212  subjects  of  highest  risk  would 
need  to  be  considered  before  the  chance  of  an  event  during  this  past  year 
exceeds  0.50. 

Our  final  conclusion  is  that  it  is  feasible  to  use  a  subject's  medical 
history  to  estimate  his  probability  of  an  event.  Problems  with  convergence  of 
the  algorithm  are  overcome  by  establishing  a  qood  initial  estimate,  using  an 
effective  procedure,  and  limiting  the  data  to  a  suhsample  of  the  large  control 
population.  An  attempt  to  establish  the  full  worth  of  the  technique  must, 
await  a  sample  with  a  reasonably  large  number  of  cases. 
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